skip to main content


Search for: All records

Creators/Authors contains: "Fernando, Milinda"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Simulations to calculate a single gravitational waveform (GW) can take several weeks. Yet, thousands of such simulations are needed for the detection and interpretation of gravitational waves. Future detectors will require even more accurate waveforms than those currently used. We present here the first large scale, adaptive mesh, multi-GPU numerical relativity (NR) code together with performance analysis and benchmarking. While comparisons are difficult to make, our GPU extension of the Dendro-GR NR code achieves a 6x speedup over existing state-of-the-art codes. We achieve 800 GFlops/s on a single NVIDIA A100 GPU with an overall 2.5x speedup over a two-socket, 128-core AMD EPYC 7763 CPU node with an equivalent CPU implementation. We present detailed performance analyses, parallel scalability results, and accuracy assessments for GWs computed for mass ratios q=1,2,4. We also present strong scalability up to 8 A100s and weak scaling up to 229,376 ×86 cores on the Texas Advanced Computing Center's Frontera system. 
    more » « less
  2. Efficiently and accurately simulating partial differential equations (PDEs) in and around arbitrarily defined geometries, especially with high levels of adaptivity, has significant implications for different application domains. A key bottleneck in the above process is the fast construction of a ‘good’ adaptively-refined mesh. In this work, we present an efficient novel octree-based adaptive discretization approach capable of carving out arbitrarily shaped void regions from the parent domain: an essential requirement for fluid simulations around complex objects. Carving out objects produces an incomplete octree. We develop efficient top-down and bottom-up traversal methods to perform finite element computations on incomplete octrees. We validate the framework by (a) showing appropriate convergence analysis and (b) computing the drag coefficient for flow past a sphere for a wide range of Reynolds numbers (0(1-10 6 )) encompassing the drag crisis regime. Finally, we deploy the framework on a realistic geometry on a current project to evaluate COVID-19 transmission risk in classrooms. 
    more » « less
  3. null (Ed.)
  4. Numerically solving partial differential equations (PDEs) remains a compelling application of supercomputing resources. The next generation of computing resources - exhibiting increased parallelism and deep memory hierarchies - provide an opportunity to rethink how to solve PDEs, especially time dependent PDEs. Here, we consider time as an additional dimension and simultaneously solve for the unknown in large blocks of time (i.e. in 4D space-time), instead of the standard approach of sequential time-stepping. We discretize the 4D space-time domain using a mesh-free kD tree construction that enables good parallel performance as well as on-the-fly construction of adaptive 4D meshes. To best use the 4D space-time mesh adaptivity, we invoke concepts from PDE analysis to establish rigorous a posteriori error estimates for a general class of PDEs. We solve canonical linear as well as non-linear PDEs (heat diffusion, advection-diffusion, and Allen-Cahn) in space-time, and illustrate the following advantages: (a) sustained scaling behavior across a larger processor count compared to sequential time-stepping approaches, (b) the ability to capture "localized" behavior in space and time using the adaptive space-time mesh, and (c) removal of any time-stepping constraints like the Courant-Friedrichs-Lewy (CFL) condition, as well as the ability to utilize spatially varying time-steps. We believe that the algorithmic and mathematical developments along with efficient deployment on modern architectures shown in this work constitute an important step towards improving the scalability of PDE solvers on the next generation of supercomputers. 
    more » « less
  5. We present a portable and highly-scalable framework that targets problems in the astrophysics and numerical relativity communities. This framework combines together the parallel Dendro octree with wavelet adaptive multiresolution and an automatic code-generation physics module to solve the Einstein equations of general relativity in the BSSNOK formulation. The goal of this work is to perform advanced, massively parallel numerical simulations of binary black hole and neutron star mergers, including Intermediate Mass Ratio Inspirals (IMRIs) of binary black holes with mass ratios on the order of 100:1. These studies will be used to study waveforms for use in LIGO data analysis and to calibrate approximate methods for generating gravitational waveforms. The key contribution of this work is the development of automatic code generators for computational relativity supporting SIMD vectorization, OpenMP, and CUDA combined with efficient distributed memory adaptive data-structures. These have enabled the development of efficient codes that demonstrate excellent weak scalability up to 131K cores on ORNL's Titan for binary mergers for mass ratios up to 100. 
    more » « less
  6. Load balancing and partitioning are critical when it comes to parallel computations. Popular partitioning strategies based on space filling curves focus on equally dividing work. The partitions produced are independent of the architecture or the application. Given the ever-increasing relative cost of data movement and increasing heterogeneity of our architectures, it is no longer sufficient to only consider an equal partitioning of work. Minimizing communication costs are equally if not more important. Our hypothesis is that an unequal partitioning that minimizes communication costs significantly can scale and perform better than conventional equal-work partitioning schemes. This tradeoff is dependent on the architecture as well as the application. We validate our hypothesis in the context of a finite-element computation utilizing adaptive mesh-refinement. Our central contribution is a new partitioning scheme that minimizes the overall runtime of subsequent computations by performing architecture and application-aware non-uniform work assignment in order to decrease time to solution, primarily by minimizing data-movement. We evaluate our algorithm by comparing it against standard space-filling curve based partitioning algorithms and observing time-to-solution as well as energy-to-solution for solving Finite Element computations on adaptively refined meshes. We demonstrate excellent scalability of our new partition algorithm up to 262,144 cores on ORNL's Titan and demonstrate that the proposed partitioning scheme reduces overall energy as well as time-to-solution for application codes by up to 22.0% 
    more » « less